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ABSTRACT 

Social networking websites allow users to create and share content. 
Big information cascades of post resharing can form as users of 
these sites reshare others’ posts with their friends and followers. 
One of the central challenges in understanding such cascading be¬ 
haviors is in forecasting information outbreaks, where a single post 
becomes widely popular by being reshared by many users. 

In this paper, we focus on predicting the final number of reshares 
of a given post. We build on the theory of self-exciting point pro¬ 
cesses to develop a statistical model that allows us to make accu¬ 
rate predictions. Our model requires no training or expensive fea¬ 
ture engineering. It results in a simple and efficiently computable 
formula that allows us to answer questions, in real-time, such as: 
Given a post’s resharing history so far, what is our current estimate 
of its final number of reshares? Is the post resharing cascade past 
the initial stage of explosive growth? And, which posts will be the 
most reshared in the future? 

We validate our model using one month of complete Twitter data 
and demonstrate a strong improvement in predictive accuracy over 
existing approaches. Our model gives only 15% relative error in 
predicting final size of an average information cascade after ob¬ 
serving it for just one hour. 

Categories and Subject Descriptors: H.2.8 [Database Manage¬ 
ment]: Database applications— Data mining 
General Terms: Algorithms; Experimentation. 

Keywords: information diffusion; cascade prediction; self-exciting 
point process; contagion; social media. 

1. INTRODUCTION 

Online social networking services, such as Facebook, Youtube, 
and Twitter, allow their users to post and share content in the form 
of posts, images, and videos |[^[^[^|^. As a user is exposed 
to posts of others she follows, the user may in turn reshare a post 
with her own followers, who may further re share it with their re¬ 
spective sets of followers. This way large information cascades of 
post resharing spread through the network. 

Permission to make digital or hard copies of all or part of this work for personal or 
classroom use is granted without fee provided that copies are not made or distributed 
for profit or commercial advantage and that copies bear this notice and the full cita¬ 
tion on the first page. Copyrights for components of this work owned by others than 
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re¬ 
publish, to post on servers or to redistribute to lists, requires prior specific permission 
and/or a fee. Request permissions from Permissions@acm.org. 

KDD’15, August 10-13, 2015, Sydney, NSW, Australia. 

© 2015 ACM. ISBN 978-1-4503-3664-2/15/08 ...$15.00. 

DOI: http://dx.doi.org/10.1145/2783258.2783401 


A fundamental question in modeling information cascades is to 
predict their future evolution. Arguably the most direct way to for¬ 
mulate this question is to consider predicting the final size of an 
information cascade. That is, to predict how many reshares a given 
post will ultimately receive. 

Predicting the ultimate popularity of a post is important for con¬ 
tent ranking and aggregation. For instance, Twitter is overfiowing 
with posts and users have a hard time keeping up with all of them. 
Thus, much of the content gets missed and eventually lost. Accu¬ 
rate prediction would allow Twitter to rank content better, discover 
trending posts faster, and improve its content-delivery networks. 
Moreover, predicting information cascades allows us to gain fun¬ 
damental insights into predictability of collective behaviors where 
uncoordinated actions of many individuals lead to spontaneous out¬ 
comes, for example, large information outbreaks. 

Most research on predicting information cascades involves ex¬ 
tracting an exhaustive set of features describing the past evolution 
of a cascade and then using these features in a simple machine 
learning classifier to make a prediction about future growth Eli 
[^[^[^1^. However, feature extraction can be expensive and 
cumbersome, and one is never sure if more effective features could 
be extracted. The question remains how to design a simple and 
principled bottom-up model of cascading behavior. The challenge 
lies in defining a model for an individual’s behavior and then ag¬ 
gregating the effects of the individuals in order to make an accurate 
global prediction. 

Present work. Here we focus on predicting the final size of an in¬ 
formation cascade spreading through a network. We develop a sta¬ 
tistical model based on the theory of self-exciting point processes. 
A point process indexed by time is called a counting process when 
it counts the number of instances (reshares, in our case) over time. 
In contrast to homogeneous Poisson processes which assume con¬ 
stant intensity over time, self-exciting processes assume that all the 
previous instances {i.e., reshares) infiuence the future evolution of 
the process. Self-exciting point processes are frequently used to 
model “rich get richer” phenomena |[^[^|^[^. They are ideal 
for modeling information cascades in networks because every new 
reshare of a post not only increases its cumulative reshare count by 
one, but also exposes new followers who may further reshare the 
post. 

We develop Seismic {Self-Exciting Model of Information Cas¬ 
cades) for predicting the total number of reshares of a given post. 
In our model, each post is fully characterized by its infectiousness 
which measures the reshare probability. We allow the infectious¬ 
ness to vary freely over time in agreement with the observation that 
the infectiousness can drop as the content gets stale (see Figure[^. 
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Figure 1: First 6 hours of retweeting activity of a popular 
tweet Q (top). The controversial tweet is about the fresh death 
of dictator Muammar Gaddafi and mentions singer Justin 
Bieber. Interestingly, the car manufacturer Chevrolet Twitter 
account inappropriately retweeted the tweet about 30 minutes 
after the original tweet, which possibly lead to tweet’s sustained 
popularity. Tweet infectiousness against time as estimated by 
Seismic (middle). Predictions of the tweet’s final retweet count 
(denoted as “Truth”) as a function of time (bottom). We com¬ 
pare Seismic with time series linear regression (LR), “Ob¬ 
served” plots the cumulative number of observed retweets by a 
given time. Notice Seismic quickly finds an accurate estimate 
of the tweet’s final retweet count. 


Moreover, our model is able to identify at each time point whether 
the cascade is in the supercritical or subcritical state, based on 
whether its infectiousness is above or below a critical threshold. 
A cascade in the supercritical state is going through an “explosion” 
period and its final size cannot be predicted accurately at the cur¬ 
rent time. On the contrary, a cascade is tractable if it is in subcriti¬ 
cal state. In this case, we are able to predict its ultimate popularity 
accurately by modeling the future cascading behavior by a Galton- 
Watson tree. 

Our Seismic approach makes several contributions: 

• Generative model: Seismic imposes no parametric assump¬ 
tions and requires no expensive feature engineering. More¬ 
over, as complete social network structure may be hard to ob¬ 
tain, Seismic assumes minimal knowledge of the network: 


The only required input is the time history of reshares and 
the degrees of the resharing nodes. 

• Scalable computation: Making a prediction using Seismic 
only requires computational time linear in the number of ob¬ 
served reshares. Since predictions for individual posts can be 
made independently, our algorithm can also be easily paral¬ 
lelized. 

• Ease of interpretation: For an individual cascade, the model 
synthesizes all its past history into a single infectiousness pa¬ 
rameter. This infectiousness parameter holds a clear mean¬ 
ing, and can serve as input to other applications. 

We evaluate Seismic on one month of complete Twitter data, 
where users post tweets which others can then reshare by retweet¬ 
ing them. We demonstrate that Seismic is able to predict the fi¬ 
nal retweet count of a given tweet with 30% better accuracy than 
the state-of-the-art approaches {e.g., fT^). For reasonably popu¬ 
lar tweets, our model achieves 15% relative error in predicting the 
final retweet count after observing the tweet for 1 hour, and 25% 
error after observing the tweet for just 10 minutes. Moreover, we 
also demonstrate how Seismic is able to identify tweets that will 
go “viral” and be among the most popular tweets in the future. By 
maintaining a dynamic list of 500 tweets over time, we are able to 
identify 78 of the 100 most reshared tweets and 281 of the 500 most 
reshared tweets in just 10 minutes after they are posted. 

The rest of the paper is organized as follows: Section sur¬ 
veys the related work. Sectionj^describes Seismic, and Section]^ 
shows how the model can be used to predict the final size of an 
information cascade. We evaluate our method and compare its per¬ 
formance with a number of baselines as well as state-of-the-art ap¬ 
proaches in Section 1^ Last, in Sectionwe conclude and discuss 
future research directions. 


2. RELATED WORK 

The study of information cascades is a rich and active field (23 
Recent models for predicting size of information cascades are gen¬ 
erally characterized by two types of approaches, feature based meth¬ 
ods and point process based methods. 

Feature based methods first extract an exhaustive list of poten¬ 
tially relevant features, including content features, original poster 
features, network structural features, and temporal features ||^. Then 
different learning algorithms are applied, such as simple regression 
models |[^[^, probabilistic collaborative filtering j^, regression 
trees p), c ontent-based models p^ , and passive-aggressive algo¬ 
rithms There are several issues with such approaches: labori¬ 
ous feature engineering and extensive training are crucial for their 
success, and the performance is highly sensitive to the quality of 
the features ||^|^. Such approaches also have limited applicabil¬ 
ity because they cannot be used in real-time online settings—given 
the massive amount of posts being produced every second, it is 
practically impossible to extract all the necessary features for every 
post and then apply complicated prediction rules. In contrast. Seis¬ 
mic requires no feature engineering and results in an efficiently 
computable formula that allows it to predict the final popularity of 
millions of posts as they are spreading through the network. 

The second type of approach is based on point processes, which 
directly models the formation of an information cascade in a net¬ 
work. Such models were mostly developed for the complementary 
problem of network inference, where one observes a number of in¬ 
formation cascades and tries to infer the structure of the underlying 
network over which the cascades propagated |[8] p^[T^[T4|p~5][T^ 
[^[^. These methods have been successfully applied to study the 
spread of memes on the web fT0|[T4l[^[^ as well as hashtags on 








Symbol 

Description 

w 

Post/information cascade 

Pt 

Infectiousness of w at time t (Section 3.21 

4>{s) 

Memory kernel (Section |3Jj) 

i 

Node that contributed re share, 
z = 0 corresponds to the originator of the post. 

ti 

Time of the reshare relative to the original post. 

rii 

Out-Degree of the node 

Rt 

Cumulative popularity by time t: {z > 0; U < t} 

Roo 

Final popularity (final number of reshares): {z > 0} 

Nt 

Cumulative degree of resharers by time t\ ^ .<t'^t 

Nt 

Effective cumulative degree of resharers by time t\ 

Nt = EfJo ft] 

At 

Intensity of cumulative popularity Rt 

Pt 

Model’s estimate of infectiousness pt at time t 

^oo(t) 

Model’s estimate at time t of final popularity Roo 


Table 1: Table of symbols. 


Twitter p^ . In contrast, our goal is not to infer the network but to 
predict the ultimate size of a cascade in an observed network. 

A major distinction between our model and existing methods 
based on Hawkes processes (e.g., |[^|^[^[^|^) is that we 
assume the process intensity At depends on another stochastic pro¬ 
cess pt, the post infectiousness. In other words, we allow the in¬ 
fectiousness to change over time. Moreover, some of these meth¬ 
ods p4) rely on computationally expensive Bayesian inference, 
while our method has linear time complexity. Another recently 
proposed related work is which also takes the point process 
approach and directly aims to predict tweet popularity. However, 
their method makes restrictive parametric assumptions and does not 
consider the network structure, which limits its predictive ability. 
We compare Seismic with (H in Section 1^ and demonstrate a 
30% improvement. 

3. MODELING INFORMATION CASCADES 

In this section, we describe Seismic and discuss how it can be 
used for: 

1. Estimating the spreading rate of a given information cascade, 
which we quantify by the post’s infectiousness. 

2. Determining whether the cascade is in supercritical (explo¬ 
sive) or subcritical (dying out) state. 

3. Predicting the final size of an information cascade, which is 
measured by the ultimate number of reshares received by the 
post that started the cascade. 

Important quantities in our model are the total number of re¬ 
shares Rt of a given post up to time t and the cascade speed of 
spreading At. In our model. At is determined by the post infec¬ 
tiousness pt and human reaction time. Our goal is to predict Roo, 
the final number of reshares. 

Another important quantity in our model is the memory kernel 
0(s), which quantifies the delay between a post arriving to a user’s 
feed and the user resharing it. Intuitively, infectiousness defines the 
probability that a given user will reshare a given post, and the mem¬ 
ory kernel models user’s reaction time. By combining the two we 
can then accurately model the speed at which the post will spread 
through the network. Table [^summarizes the notation. 

3.1 Human reaction time 

In order to predict the cascade size, we need to know how long 
it takes for a person to reshare a post. Knowing the delay allows 
us to accurately model the speed of a cascade spreading through 


the network. We consider that the time s between the arrival of 
a post in a users’ timeline and a reshare of the post by the user 
is distributed with density 0(s). The probability density 0(s) is 
also called a memory kernel because it measures a physical/social 
system’s memory of stimuli I?]. 

The distribution of human response time 0(s) has been shown to 
be heavy-tailed in social networks |[^. Usually the tail of 0(s) is 
assumed to follow a power-law with exponent between 1 and 2 or 
a log-normal distribution ||^|^. However, due to the rapid nature 
of information sharing on Twitter, it is also natural to expect many 
inst ant r eaction times. In fact, our exploratory data analysis in Sec¬ 
tion 5.2 confirms that in Twitter, 0(s) is approximately constant for 


the first 5 minutes and then followed by a power-law decay. Dif¬ 
ferent social networks may have different distributions of human 
reaction times. However, 0(s) only needs to be estimated once per 
network and thus we can safely assume it is given. We describe a 


detailed estimation procedure of 0(s) in Section 5.2 


3.2 Post infectiousness 

The second component of our model is the post infectiousness. 
We assume each post w is associated with a time dependent, intrin¬ 
sic infectiousness parameter pt(it;). In other words, pt{w) models 
how likely the post w is to be reshared at time t. Infectiousness 
of a post may depend on a combination of factors, including but 
not limited to the quality of the post’s content, the social network 
structure, the current local time, and the geographical location. In¬ 
stead of assuming a parametric form of pt, we model it flexibly in a 
nonparametric way which implicitly accounts for all these factors. 

Most existing methods studying self-exciting point processes as¬ 
sume Pt to be fixed over time. Consequently, an important concept 
is the criticality of the process Rt. In a self-exciting point process 
with constant infectiousness pt = p, there exists a phase transition 
phenomenon at certain critical threshold p* such that (TT): 

1. If p > p*, then ^ oo as t ^ (X) almost surely and 
exponentially fast. This is called the supercritical regime. 

2. If p < p*, then sup^ Rt < oo almost surely. This is called 
the subcritical regime. 

In reality, Rt is always bounded due to the finite size of the net¬ 
work. Thus, no supercritical cascades can exist if pt is assumed to 
be a constant. This is inadequate to model highly contagious tweets 
and our assumption of non-constant infectiousness solve this prob¬ 
lem as well. Furthermore, as the post gets older the information 
becomes outdated and its spreading power (infectiousness) may de¬ 
crease. This effect may also be observed as the post spreads further 
away from the original poster p4) . Alternatively, resharing by a 
highly influential user may increase the post’s infectiousness. Thus, 
rather than assuming a common evolutionary pattern of pt for all 
the tweets, we only assume it varies smoothly over time and use 
non-parametric methods to estimate pt for each tweet. 


3.3 The SEISMIC model 

We combine human reaction times and post infectiousness to de¬ 
rive Seismic. In order to link pt to the post resharing process 
Rt, we model Rt as a doubly stochastic self-exciting point process. 
This is an extension to the standard self-exciting point process (also 
called the Hawkes process p6) ) which was initially used to model 
earthquakes p5) . 

We first define the intensity At of Rt, which simply measures the 
rate of obtaining an additional reshare at time t. More formally: 


At = lim 
A^O 


P {Rt-\-A — Rt — 1) 


A 








In Seismic, the intensity Xt at time t is determined by infec¬ 
tiousness pt, reshare times ti, node degrees and human reaction 
time distribution 0(s). The exact relationship described in Eq. Q 
is inspired by the theory of Hawkes processes H): 

\=Pf ni(p{t-ti), t>to. 

ti<t, i>0 

Intuitively, <t *>0 ~ the intensity of the arrival 

of newly exposed users at time t and its product with the resharing 
probability pt gives the intensity of reshares at time t. 

Note that the above point process is called self-exciting because 
each previous observation i such that ti < t contributes to the in¬ 
tensity At, or equivalently, each observation increases the intensity 
in the future. It is further doubly stochastic (or a Cox process) be¬ 
cause the infectiousness pt is itself a stochastic process. 

Additionally, we assume node degrees {rii} are independent and 
identically distributed with mean n*. Mean degree n* is relat ed to 
the critical threshold p* which is already discussed in Section 
The critical infectiousness threshold takes value p* = 1/n*. We 
give the proof of this fact in Proposition |4.1| 


The denominator in Eq. denoted as Nt hereafter, can be in¬ 
terpreted as the accumulative “effective” number of exposed users 
to the post. The numerator Rt is the current number of reshares 
of the post. To shed more light on our estimator, we take t ^ oo, 
which leads to: 



Thus, by assuming the infectiousness pt to be a constant over time, 
one would essentially assume that most posts have the same infec¬ 
tiousness 1/n*. However, such assumption is unrealistic as it can¬ 
not explain the bursty and volatile dynamics information cascades 
(e.g.. Figure^. 

This undesirable consequence of assuming constant pt is another 
motivation for allowing pt to vary over time. To estimate pt in this 
case, we smooth the MLE in Eq. 0 by only using observations 
close to time t to estimate pt. In particular, we rely on a sequence 
of one-sided kernels Kt{s), s > 0, indexed by time t. We use 
these kernels to weight the reshares and the weighted estimate of 
Pt is given by 


4. PREDICTING INFORMATION CASCADES 

In this section we describe how to perform statistical inference 
for the self-exciting model of information cascades introduced in 
the previous section. Specifically, we discuss how Seismic esti¬ 
mates the infectiousness parameter pt and then predicts the ultimate 
size of the cascade Roo . 

Throughout this section, we make a technical assumption that 
the followers of all the resharers are disjoint, so we can use a tree 
structure to describe the information diffusion (Eigure|^. The con¬ 
clusions made in this section remain valid even if resharers are not 
disjoint. In this case, we can replace the node degree rii with the 
total number of newly exposed neighbors of node i (the followers 
of i-th resharer who do not follow the first i — 1 resharers). 


4.1 Estimating post infectiousness 

We first define the sample-function density, which plays a central 
role in estimating self-exciting point processes p^ . Let’s denote 
Tt — <J ^OlSo) cr-algebra generated by all the in¬ 

formation available by time t: the times ti of all the reshares up 
to time t and the number of followers ii.e., node degree) ra of the 
z-th user to reshare. Sample-function density is defined as the joint 
probability of the number of reshares in the time interval [to, t) and 
the density of their occurrence times. 

To motivate our estimator of pt, we first consider the case where 
the infectiousness parameter remains constant over time, i.e., pt = 
p. Later we will relax this assumption and allow pt to vary over 
time. 

In Seismic, the sample-function density can be expressed using 
the intensity At as Thm. 6.2.2] 


Rt 


¥{Rt = r,ti,... ,tr) = A 


U • exp 


//'*} 


( 2 ) 


By taking derivative of the log of Eq. and combining it with 
Eq. Q, we obtain the maximum likelihood estimate (MLE) of pt : 


Pt = 


Rt 


Efjo ft* 


( 3 ) 


The above equation forms the basis of Seismic as it allows us 
to estimate the infectiousness pt at any given time t. Moreover, a 
confidence interval of pt can also be obtained p^ . 


^ ^ j; Ktjt - s)dRs 

SlKt{t-s)dNl 

EfftKt{t-U) 

Notice that when Kt{s) = 1 the estimator reduces to the MLE 
we derived in Eq. In SEISMIC we use a triangular kernel with 
growing window size t/2 as weighting kernel Kt(s): 

A't(s) = max 01 , s > 0. (6) 

We chose the triangular kernel because it has properties impor¬ 
tant for our application. Eirst, the kernel discards all posts that are 
older than t/2. In particular, it quickly discards the unstable and 
potentially explosive period at the beginning, which if included, 
would introduce an upward bias to pt. Second, the kernel takes into 
account posts in a larger window size as time t increases. Accord¬ 
ing to our experiments, the growing window size helps to stabilize 
p(t) compared to a fixed window size. Third, for reshares within 
the window, the kernel up-weights the most recent posts and gradu¬ 
ally down-weights older posts. This keeps our estimator p(t) closer 
to the ever-changing true pt. And last, as Kt{s) is piece-wise lin¬ 
ear, the integral J Kt{t — s)(t){s — ti)ds has a closed form for many 
different functions 0(s) including the one we use for Twitter in our 
experiments, see Section 

4.2 Predicting final popularity 

Having described the procedure for inferring the post infectious¬ 
ness, we now need to account for the network structure in order to 
predict how far the post is going to spread across the network. 

Eor simplicity, let us assume the post is first posted at time 0, i.e., 
to = 0. Consider we have observed the post for t time units and 
our goal now is to predict the post’s final reshare count, itoo, based 
on the information we have observed so far, Rf 

The following proposition shows how to compute the expected 
final reshare count of a post. The main idea is to model an informa¬ 
tion cascade spreading over the network with a branching process 
that counts the reshare number of a post, as illustrated in Eigurej^ 
Predictor for Roo used by Seismic can be stated as follows: 






Time t 



Algorithm 1 Seismic: Predict final cascade size 

Purpose: For a given post at time t, predict its final reshare count 
Input: Post resharing information: ti and for z = 0,..., 

Algorithm: 

Nt = 0 , = 0 

for z = 0,..., do 

Nt += rii 

Nt += rii (j)(s — ti)ds (Sec. |3.l| ) 

end for 

i?oo(i) = Rt + atpt{Nt - iVf)/(l - (Alg.U 

Deliver: Roo{t) 


Next, consider the case where p = pt > 1/n* . In this regime, 
the point process is supercritical and stays explosive. In terms of 
the Galton-Watson tree discussed above, the offspring expectation 


Figure 2: An illustration of the information diffusion tree. We 
observe the cascade up to time t (denoted by a dashed line) 
and the question is how the cascade tree is going to grow in 
the future. We define variables Zk which denote the number 
of reshares caused by the generation descendants. Using 
variables Zk the final reshare count Roo can then be simply 

oo 

computer asRt-\- Zk- 

k=i 


^ji — n ^ p > 1, so E[Zfc+i] > E[Zfc] > • • • > E[Zi]. Therefore 
the total future reshares has infinite expectation and the 

final reshare count cannot be reliably predicted. □ 

Note that Prop. |4.1 [ assumes that the post infectiousness remains 
constant in the future {ps — pt for s > t), which could be unre¬ 
alistic for some information cascades. We correct this by changing 
the prediction formula in Eq. by adding two scaling constants 
at, 7 t that adjust the final prediction: 


Proposition 4.1 . Assume the (out- jdegrees in the network are 
i.i.d. with expectation rz* and the infectiousness parameter ps is a 
constant p for s > t. Then, we have 


E[i?oo| Rt] 


p{Nt - Nt) 

1 5 

, if p < 

1 — pn* 

00 , 

if p > 


(7) 


Proof. First, we consider the case where p < 1/rz* . We define 
a sequence of random variables {Zi, Z 2 , Zs, • • •} that models the 
future information diffusion tree, as illustrated in Figure 1^ In this 
tree, Zk denotes the number of reshares made by the /cogenera¬ 
tion descendants (counting from generation Rt onward). Thus, the 

generation descendants Zi refers to the number of new reshares 
generated by the posts created before time t, the 2 ”^ generation 
descendants Z 2 refers to the reshares of the posts of the descen¬ 
dants, and so on. Notice that the summation over the Zk's gives the 
post’s final reshare count Roo = Rt A- Zk- In the following 

we use descendants Zk only for deriving Eq. 0 and emphasize 
that our final estimator does not require explicit network structure 
information. 

Given Zi, the sequence of random variables Zk defines a Galton- 
Watson tree with the offspring expectation p = n^p GD. Here, p 
denotes the expected number of reshares that the post gets. Using a 
standard branching process result, we have Zi/fff is a martingale. 
Therefore, V/c > 1, E [Zk-\-i\Zk] — p Zk, and. 


E 





Hence, we obtain 


^1 ^ ^1 

(1 - p) (1 - n*p) ■ 


1E[-Roo — Rt “I" IE 



— Rt ~\~ 


E[Zi] 

(1 - n,p) ’ 


which ends up being the right hand side in Eq. 0 because E[Zi] = 
p{Nt — Nt) by the definition of Zi and N^. 


Roo{t) = Rt + , 0 < at, 7t < 1 ■ (8) 

1 - 

We introduce these correction factors based on the following intu¬ 
ition. We expect at to decrease over time t so it scales down the 
estimated infectiousness in the future, which accounts for the post 
getting stale and outdated. Similarly, 7 ^ accounts for the overlap 
in the neighborhoods of reposters’ followers. Over time as the post 
spreads farther in the network, we expect 7 ^ to increase as more 
nodes get exposed multiple times, which means the arrival rate of 
new nodes (previously unexposed nodes) decreases over time. 

We use the same values of at and 7 ^ for all posts but allow them 
to vary over time. The values of at and jt are selected to mini¬ 
mized median Absolute Percentage Error (refer to Sectio n |5.4| for 
definition) on a training data set. As described in Section we 
find at is more important than 7 ^ in practice. 

4.3 The SEISMIC algorithm 

Last, we put together all the components described so far and 
synthesize them in the SEISMIC algorithm. The SEISMIC algorithm 
for predicting Roo{t) is described in Algorithm which uses the 
algorithm for computing pt (Algorithmic as a subroutine. These 
algorithms are based on Eqs. ^ and We assume parameters 
Kt{s), at, jt, are given a priori or estimated from the data. 

Computational complexity of Seismic. For any choice of f>{s) 
and Kt{s), the computational cost of Seismic is 0{Rt) for both 
calculating pt and predicting Roo (t)- Of course, the actual comput¬ 
ing time depends heavily on the integration Kt{t—s)f){s—ti)ds 
and (j){s — ti)ds. However, the overall computational cost of 
Seismic is linear in the observed number of reshares Rt of a given 
post by time t. 

The linear time complexity is in part also due to the shape of our 
memory kernel. In Section [5^ we will estimate the memory kernel 
0(s) for Twitter to have the following form (for some so > 0): 


c(s/so) 


-( 1 + 0 ) 


if 0 < s < So, 
if s > So. 


(9) 



























Algorithm 2 Compute real-time infectiousness p{t) 

Purpose: For a given post w, calculate infectiousness pt with 
information about w prior to time t 

Input: Post resharing information: ti and for i = 0,..., 

Algorithm: 

4 = 0 , = 0 

for i = 0,..., do 

Rt += Kt{t - ti) 

end for 

,Rt do 


for z = 0,.. 

+= n. 

end for 

Pt = Rt/Nt^ 

Deliver: pt 


Kt{t - s)0(s - ti)ds 


(Sec. 
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This means that with the memory kernel 0(s) in Eq. and tri¬ 
angular weighting kernel Kt{s) in Eq. <6}, both integrals can be 
evaluated in closed form because they are piece-wise polynomials 
(polynomial with possibly non-integer exponents), which greatly 
decreases computational cost of Seismic. 

5. EXPERIMENTS 

In this section, we describe the Twitter data set, our parameter 
estimation procedure, and compare the performance of Seismic to 
state-of-the-art approaches. 

5.1 Data description and data processing 

Our data is the complete set of over 3.2 billion tweets and retweets 
on Twitter from October 7 to November 7, 2011. Eor each retweet, 
the dataset includes tweet id, posting time, retweet time, and the 
number of followers of the poster/retweeter. Note, the data set lacks 
Twitter network information. The only piece of network informa¬ 
tion available to us is the number of followers of a node. 

We focus on a subset of reasonably popular tweets with at least 
50 retweets, so that our model enables the prediction as soon as 
sufficient number of retweets occur. Note, that if multiple Twitter 
users independently post the same tweet, which then gets retweeted, 
each original posting creates its own independent cascade. All in 
all there are 166,076 tweets satisfying this criterion in the first 15 
days. We form the training set using the tweets from the first 7 
days and the test set using the tweets from the next 8 days. We 
use the remaining 14 days for the retweet cascades to unfold and 
evolve. Eor a particular retweet cascade, we obtain all the retweets 
posted within 14 days of the original post time, i.e., we approxi¬ 
mate Roo by 7^14 days. Wc estimate parameters (/)(s), at, 7 t and n* 
with the training set, and evaluate the performance of the estimator 
Roo on the test set. Eor the tweets in our training set, Ru days has 
mean 209.8 and median 110. The temporal evolution of mean and 
median of Rt are also shown in Eigurej^ 

5.2 SEISMIC parameter estimation 

First we describe how to fit the memory kernel (f){s) (Section[3^. 
We carefully choose 15 tweets in the training set and use the dis¬ 
tribution of all their retweet times as our 0(s) (Figure]^. The his¬ 
tograms of the 15 sequences of retweet times all display a clear 
shape of subcritical decay. Moreover, all the original posters have 
an overwhelming number of followers. Therefore, most of the 
retweets, if not all, should come from the immediate followers of 
the original poster. Consequently, the distribution of human reac¬ 
tion time can be well approximated by that of the retweet times of 


Figure 3: Convergence of the mean and media cumulative 
retweet count Rt as a function of time.The horizontal lines cor¬ 
respond to mean and median final retweet count i^iddays* On 
average, a tweet receives 75 % of its retweets in the first 6 hours. 

Memory Kernel (|)(s) 



— Fitted 

— Observed 


Figure 4: Reaction time distribution and the estimated memory 
kernel 0(s). The reaction time is plotted on logarithmic axes, 
hence the linear trend suggests a power law decay. 


these 15 tweets. The estimation of (/)(s) can be further improved if 
the network structure is available. 

The observed reaction time distribution (Figure suggests a 
form of Eq. for the memory kernel: constant in the first 5 min¬ 
utes, followed by a power-law decay. After setting the constant 
period so to 5 minutes, we estimate power law decay parameter 
0 = 0.242 with the complimentary cumulative distribution func¬ 
tion (ccdf), and chose c = 6.27 x 10“^ to make (j)(s)ds = 1. 
The memory kernel is a network wide parameter and only needs to 
be estimated once. The fitted memory kernel is plotted in Eigure]^ 

Last, we briefiy comment on the correction factors at and jt in¬ 
troduced in Eq. ([^. We use the same values of at and jt for all 
tweets. Notice that 7 t and n* only affect the predictions through 
their product 7 tn*. Overall, we find the value of 7 tn* has little 
effect on the performance of our algorithm. In our experiments we 
simply set 7 tn* = 20 for all t. We choose the value of at such 
that it minimizes the training median Absolute Percentage Error 
(Section [5^ . We report values of at in Table at has a par¬ 
ticularly small value at t = 5 minutes, which may be a result of 
the overestimation of pt, when the triangular kernel has not moved 
away from the unstable beginning period. After that at begins a 

















time (minute) 

5 

10 

15 

20 

30 

a 

0.389 

0.803 

0.772 

0.709 

0.680 

time (minute) 

60 

120 

180 

240 

360 

a 

0.562 

0.454 

0.378 

0.352 

0.326 


Table 2: Values at used in Algorithmj^ 


slow and consistent decay to account for the fact that information 
is getting increasingly stale and outdated over time. 

With all the estimated parameters in Seismic, we are ready to 
apply it (Algorithms and to the Twitter dataset. For a given 
tweet w and every 5 minute interval t, we output our estimate 
Roo (t, w) of the tweet’s final retweet count Roo (w). 


5.3 Baselines for comparison 

We consider four different prediction methods for comparison. 
The first two are regression based and the next two are point process 
based. 

• Linear regression (LR) (31) : The model can be defined as 

logR oo = at + log Rt + e, 

where e denotes the Gaussian noise. This is also the second 
baseline estimator used in (34) Notice that all the tweets 
receive the same multiplicative constant for a given time. 

• Linear regression with degree (LR-D) (3l) : This model 
can be written as 


log Roo = + l3i,t log Rt + / 32 ,t log Nt + / 33 ,t log no + e 

where e denotes, as before, the Gaussian noise. LR-D is more 
flexible than LR, since it allows log Rt to have a slope not 
equal to 1 and uses additional features. 

• Dynamic Poisson Model (DPM) It models the retweet 
times {tk} as a point process with rate 

At = Atpg^(t — tpeak)^ 

where tpeak = argmaxs<t Ag. The power-law parameter 7 is 
estimated separately for each tweet. To discretize the model, 
we bin the retweet times into 5 = 5 minute intervals. Note 
that when 7 > — 1, the integral Xtdt is infinite. In 

such cases, we move tpeak forward to the second maximum 
bin. 

• Reinforced Poisson Model (RPM) (l2) : This recently pub¬ 
lished state-of-the-art approach models the reshare rate as 

At = cf^{t)ra{Rt) 

where parameter c measures the attractiveness of the mes¬ 
sage, f-fit) oc > 0) models the aging effect, and 

ra{Rt){oi > 0 ) is the reinforcement function which depicts 
the “rich get richer” phenomenon. Given a particular tweet, 
the parameters c, 7 , a are found by maximizing the likeli¬ 
hood function, where the optimal values are projected to their 
feasible sets whenever they are out of range. 


5.4 Evaluation metrics 

For a particular tweet, suppose that the prediction for Roo at time 
t is denoted by Roo (f)- We use the following evaluation metrics in 
our experiment: 

• Absolute Percentage Error (APE): For a given tweet w and 
a prediction time t, the APE metric is defined as. 


APE (re, t) 


\Roo{wR) - Roo{w)\ 
Roo 


When the APE metric is used for evaluation purposes, vari¬ 
ous quantiles of APE over the tweets (all possible w) in the 
test dataset will be reported at each time t. 

• Kendall-r Rank Correlation: This is a measure of rank cor¬ 
relation (T^ , which computes the correlation between the 
ranks of Roo (t) and Roo for all test tweets. This metric is 
generally more robust than Pearson’s correlation of values of 
Roo {t) and Roo ■ A high value of rank correlation means the 
predicted and the final retweet counts are strongly correlated. 

• Breakout Tweet Coverage: We create a ground-truth list of 
iop-k tweets with the highest final retweet count. We refer to 
these tweets as “breakout” tweets. Using our model we can 
also produce a iop-k list based on the predicted final retweet 
count. We evaluate the methods by quantifying how well the 
predicted iop-k list covers the ground-truth top-k list. We 
give additional details in Section [J.5.3| 

5.5 Experimental results 

In this section we evaluate the performance of Seismic and the 
four competitors described in Section [53] All the methods start 
making predictions as soon as a given tweet gets retweeted 50 
times. 

5 . 5.7 SEISMIC model validation 

First, we empirically validate Seismic. In Proposition |4.1| we 
obtain a formula for the expected number of final retweets in terms 
of the infectiousness parameter pt. Our goal here is to show that 
Proposition |4. 1 [ provides an unbiased estimate of the true final retweet 
count. We proceed as follows. We use Seismic to make a predic¬ 
tion after observing each tweet for 1 hour and then plot the predic¬ 
tion against the true final number of retweets. If Seismic gives an 
unbiased estimate, then we expect a diagonal curve y = x, that is, 
the expected predicted Roo matches the true expected Roo • 

Figure shows that the empirical average almost perfectly coin¬ 
cides with Seismic’s prediction. This suggests that the Seismic 
estimator in Eq. (|^ is unbiased and we can safely use it to predict 
the expected final number of retweets. However, as mentioned ear¬ 
lier, in practice one often wants to shrink the prediction in order 
to stabilize the estimator and achieve better overall performance. 
Therefore, we use the calibrated prediction formula Eq. ^ for the 
rest of the experiments. 

5 . 5.2 Predicting final retweet count 

We run our SEISMIC method for each tweet and compute the 
Absolute Percentage Error (APE) as a function of time. We plot 
the quantiles of the distribution of APE of SEISMIC in Figure 
After observing the cascade for 10 minutes (t =10 min), the 95th, 
75th, and 50th percentiles of APE are less than 71% , 44%, and 
25%, respectively. This means that after 10 minutes, average error 
is less than 25% for 50% of the tweets and less than 71% for 95% 
of the tweets. After 1 hour the error gets even lower—APE for 
95%, 75% and 50% of the tweets drops to 62%, 30% and 15%, 
respectively. 

The proposed method. Seismic, demonstrates a clear improve¬ 
ment over the baselines and the state-of-the-art as shown in Eigures 
[^and[^ The left panels of Figures[^and[^show the median APE of 
different methods over time as more and more of the retweet cas¬ 
cade gets revealed. The LR and LR-D baselines have very similar 
performances, indicating the additional features used by LR-D are 
not very informative. DPM performs poorly across the entire tweet 
lifetime, while the other point process approach RPM is worse than 
LR and LR-D in the early period but becomes better after about 2 
hours. All in all, in terms of median APE score SEISMIC is about 
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Figure 5: Predicted final retweet counts nicely follow the 
ground-truth retweet counts, which suggests Seismic provides 
an unbiased estimate of the final retweet count. The dashed red 
curve is obtained by binning the tweets according to the pre¬ 
diction and then computing the average number of retweets in 
each bin. 
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Figure 6: Absolute Percentage Error (APE) of Seismic on the 
test set. We plot the median and the middle 50th, 80th, 90th 
percentiles of the distribution of APE across the tweets. 


Figure 7: Median Absolute Percentage Error (APE) and 
Kendall’s Rank Correlation of Seismic and the baselines as a 
function of time. Seismic consistently gives best performance. 
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Figure 8: Zoom-in of Figure]^ Median APE and Rank Corre¬ 
lation for the first 60 minutes after the tweet was posted. Seis¬ 
mic performs especially well compared to the baselines early in 
the tweet’s lifetime. 


30% more accurate than all the competitors across the entire twee 
lifetime. 

Similarly, the right panels of Figures[7]and[^show the Kendall-r 
rank correlation between the predicted ranking of top most retweeted 
tweets and the ground-truth ranking of tweets. Again Seismic is 
giving much more accurate rankings than other methods. 

5.5.3 Identifying breakout tweets 

Can we identify a breakout tweet before it receives most of its 
retweets? This question arises from various applications like trend 
forecasting or rumor detection. The goal of this prediction task is 
to as early as possible identify “breakout” tweets, which have the 
highest final retweet count. We quantify the performances of dif¬ 
ferent models in detecting breakout tweets by using models’ pre¬ 
dictions of tweets’ final retweet counts. 

First, we form a ground-truth set of size M. The set 

contains top-M tweets with the highest final retweet counts. Then 
with each of the prediction methods, we produce a sequence of size 


m lists, Lm{t)- At each time t the list Lm{t) contains the top-m 
tweets with the highest pre dicte d retweet counts at time t. 

we then compare each Lm (t) with 


As described in Section 


5.4 


Lm, and calculate the Breakout Tweet Coverage, which is defined 
as the proportion of tweets in covered by Lm{t)- 

Fig. 1^ shows the performance of SEISMIC in detecting top 100 
most retweeted tweets (LIqq) as a function of time. Seismic is 
able to cover 82 tweets in the first 1 hour and 93 tweets in the first 
6 hours. 

The fifth most retweeted tweet in this plot is actually the tweet 
we showed earlier in Figure We observe that Seismic detects 
this tweet 30 minutes after it has been posted, while LR and LR-D 
both take more than an hour. DPM fails to detect this breakout for 
the first 6 hours (plots not show for brevity). 

To compare Seismic with other methods, we keep the size of 
the predicted lists to be m = 500, and use a larger target list L 500 , 
which is a more difficult task than finding LIqq. Figure [^com¬ 
pares the coverage of different methods against the proportion of 
retweets seen. After seeing 20% of the retweets. Seismic covers 
65% of the shortlist, while LR-D and LR both cover only 50%. In 
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Figure 9: Coverage of top 100 most retweeted tweets. Each row 
represents a tweet. White blocks indicate that a given tweet 
was not covered by Seismic’s predicted list of top-500 tweets 
at time t, and blue indicates successful coverage. 


general, the dynamic Poisson model fails to provide accurate pre¬ 
dictions and breakout identifications. 

Overall, Seismic allows for effective detection of breakout tweets. 
For instance, after seeing around 25% of the total number of retweets 
of a given tweet (in other words, after observing a tweet for around 
5 minutes). Seismic can identify 60% of the top- 100 tweets ac¬ 
cording to the final retweet counts. 

5.6 Discussion of model robustness 

Seismic demonstrates better robustness than the other two point 
process based methods — DPM and RPM. While Seismic is not 
able to make a prediction for tweets that are in the supercritical 
state, DPM and RPM are unable to make predictions when the de¬ 
cay parameter is outside the feasible set (7 < — 1 for DPM and 
7 < 0 or a < 0 for RPM). For example, in Figure SEISMIC 
characterizes the tweet as supercritical for the first 70min, DPM 
fails to make a prediction for the first 6 hours and RPM is only able 
to make a prediction from 30 to 80 minute. 

All in all, we find that tweets are in the supercritical regime for 
only a very short time and SEISMIC is able to make predictions 
for most of the tweets in most of the time. We find that on av¬ 
erage, Seismic is not able to make a prediction for 1.80% of the 
tweets after observing them for 15 minutes. In other words, after 
15 minutes, 1.80% of the tweets are still in the supercritical regime 
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Figure 10: Coverage of top 500 tweets (L 500 ) by various meth¬ 
ods. Seismic exhibits clear improvement over all methods after 
about 10% of retweets are observed. All methods except DPM 
achieve perfect coverage after 65% of retweets are observed. 


(over all the tweets with at least 50 retweets). This number drops to 
1.29% (0.67%) after 1 hour (6 hours). As a point of comparison we 
also note that other methods are not able to make predictions for 
a much larger fraction of tweets: DPM fails to make a prediction 
for 6.77%, 5.79% and 1.45%, and RPM fails for 3.45%, 5.69% and 
15.43% of the tweets after 15min, Ih, and 6h. 

Our Seismic method is also significantly faster than the RPM 
model [2^ , which requires to solve a nonlinear optimization prob¬ 
lem every time it predicts. In our implementation, the average run¬ 
ning time per tweet for predicting at every 5 minutes for 6 hours 
is 0.02s for Seismic and 3.6s for RPM. The reported running time 
includes both parameter learning and prediction. 

6. CONCLUSION AND FUTURE WORK 

In this paper we propose Seismic, a flexible framework for mod¬ 
eling information cascades and predicting the final size of an infor¬ 
mation cascade. Our contributions are as follows: 

• We model the information cascades as self-exciting point 
processes on Gabon-Watson trees. Our approach provides 
a theoretical framework for explaining temporal patterns of 
information cascades. 

• Seismic is both scalable and accurate. The model requires 
no feature engineering and scales linearly with the number 
of observed reshares of a given post. This provides a way to 
predict information spread for millions of posts in an online 
real-time setting. 

• Seismic brings extra flexibility to estimation and prediction 
tasks as it requires minimal knowledge about the information 
cascade as well as the underlying network structure. 

There are many interesting venues for future work and our pro¬ 
posed model can be extended in many different directions. For 
example, if the network structure is available, one could replace 
the node degree rii by the number of newly exposed followers. If 
content-based features or features of the original post are available, 
one could develop a content-based prior of pt for each post. If tem¬ 
poral features such as the users’ time zone are available, one could 
directly use them to modify the estimator pt. In this sense, the 
proposed model provides an extensible framework for predicting 
information cascades. 

Seismic is a statistically sound and scalable bottom-up model of 
information cascades that allows for predicting final cascade size as 











the cascade unfolds over the network. We hope that our framework 
will prove useful for developing richer understanding of cascading 
behaviors in online networks, paving ways towards better manage¬ 
ment of shared content. 

Data and Software 

The Seismic software and the dataset we use in Section |5] can 
be found in http : / /snap .Stanford. edu/seismic/ The 
R package of our algorithm is also available onhttp://cran. 

r-project.org/web/packages/seismic 
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